Analogy: Highway system. Each service has a lane capacity (quota). If one lane jams (throttle), traffic backs up. You need to know each road's speed limit and plan alternate routes.
Know the quotas for every service you use
Focus on trade-offs between services (speed vs cost vs reliability)
Load test with production-like traffic patterns
Monitor in production and tune continuously
Stay current with service updates (quotas increase over time)
API Gateway - Managing Scale
Feature
What It Does
Analogy
Account Quota
10,000 requests/sec across all APIs (default)
Highway speed limit for your region
Burst Capacity
5,000 requests immediate burst (token bucket)
Passing lane - short bursts allowed
Stage Throttling
Per-stage rate/burst limits
Speed limit per road segment
Route/Method
Per-route throttle (HTTP) or per-method (REST)
Speed bumps on specific streets
Usage Plans
Per-client throttle + monthly quota (API keys)
Toll pass with monthly limit per driver
Response Cache
REST APIs: cache responses to reduce backend calls
Provisioned = reserved seats on a plane (always yours). On-demand = standby (might wait). Throttled = flight is full, come back later.
Function Duration Impacts Concurrency & Cost
Duration
10 req/sec needs
100 req/sec needs
Insight
100ms
1 concurrent
10 concurrent
Low concurrency, low cost
1 sec
10 concurrent
100 concurrent
Moderate
10 sec
100 concurrent
1,000 concurrent
Hits default quota!
60 sec
600 concurrent
6,000 concurrent
Way over default limit
Concurrency = seats in a restaurant. If each customer stays 10 minutes, you need 10 seats for 1 customer/minute. If they stay 60 minutes, you need 60 seats for the same rate. Shorter functions = fewer seats needed = cheaper.
Scaling with Sync & Async Sources
Sync = drive-through (you wait in line, feel every delay). Async = online order (submit and go, they process when ready).
Scaling with SQS Event Source
Lambda starts with 5 poller processes
Adds 60 instances/minute as queue depth grows
Scales up to 1,000 concurrent (or your reserved limit)
Scales DOWN if error rate is too high (backoff protection)
Maximum Concurrency setting caps scaling per event source mapping
Tuning for Scale
Setting
Impact on Scale
Batch size (1-10,000)
Larger batch = fewer invocations needed
Batch window (0-5min)
Wait to fill batch = fewer invocations, higher latency
Visibility timeout
Set 6x function timeout to avoid duplicate processing
Max concurrency
Cap to protect downstream services
Scaling with Kinesis Data Streams
Enhanced Fan-Out for Multiple Consumers
Feature
Standard Consumer
Enhanced Fan-Out
Throughput
2 MB/sec per shard SHARED across all consumers
2 MB/sec per shard PER consumer (dedicated)
Delivery
Pull (poll-based)
Push (SubscribeToShard)
Latency
~200ms avg
~70ms avg
Best for
1-2 consumers, cost-sensitive
3+ consumers, low latency critical
Standard = shared TV antenna (split signal gets weaker per viewer). Enhanced fan-out = dedicated cable line per household (full bandwidth each).
Metrics That Indicate Scaling Issues
Service
Metric
What It Means
Lambda
Throttles
Hitting concurrency limit - increase quota or add reserved
Lambda
Duration (p99)
Approaching timeout - optimize or increase memory
Lambda
ConcurrentExecutions
Near quota - time to request increase
SQS
ApproximateAgeOfOldestMessage
Growing = processing can't keep up
SQS
ApproximateNumberOfMessages
Queue depth growing = add concurrency
Kinesis
IteratorAge
Growing = consumer falling behind producer
Kinesis
ReadProvisionedThroughputExceeded
Need more shards or enhanced fan-out
API GW
4XXError / 5XXError
Clients being throttled or backend failing
What's New (2024-2025)
Concurrency Scaling Rate - New regional limit on how fast concurrency can spike (replaces old burst model)
SnapStart - Eliminates cold start penalty at scale (Java/Python/.NET)
Q1: What is the formula for Lambda concurrency?
B) Requests/sec x Duration(sec) 10 req/sec with 1s duration = 10 concurrent. Shorter functions = less concurrency needed = less cost. A: Memory doesn't affect concurrency count. C: That's Kinesis-specific. D: Those are config, not concurrency formula.
Q2: A Lambda function with SQS source is hitting throttles. What should you do FIRST?
C) Request concurrency quota increase If throttling, you've hit the regional limit (default 1000). Request increase via Service Quotas. A: Memory doesn't affect concurrency quota. B: Max Concurrency LIMITS scaling (opposite of what you want). D: DLQ handles failures, unrelated to throttle.
Q3: How do you increase Kinesis stream processing throughput?
B) Add shards + increase parallelization factor More shards = more ingest capacity + more concurrent Lambda instances. Parallelization (1-10) multiplies concurrency per shard. A: Memory helps speed per invocation, not throughput at stream level. C: API GW isn't involved with Kinesis. D: SQS is a different source.
Q4: Why does async invocation reduce throttle impact on clients?
B) Client gets 202; Lambda handles retries internally The client doesn't wait or see the throttle. Lambda queues the event and retries for up to 6 hours until concurrency is available. A: Same concurrency limits apply. C: Same function, same speed. D: IAM always applies.
Live Demo: Load Testing & Observing Scale
Step 1: Deploy a function with reserved concurrency